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Abstract 

This  paper  describes  a  novel  technique  for  in¬ 
corporating  syntactic  knowledge  into  phrase- 
based  machine  translation  through  incremen¬ 
tal  syntactic  parsing.  Bottom-up  and  top- 
down  parsers  typically  require  a  completed 
string  as  input.  This  requirement  makes  it  dif¬ 
ficult  to  incorporate  them  into  phrase-based 
translation,  which  generates  partial  hypothe¬ 
sized  translations  from  left-to-right.  Incre¬ 
mental  syntactic  language  models  score  sen¬ 
tences  in  a  similar  left-to-right  fashion,  and  are 
therefore  a  good  mechanism  for  incorporat¬ 
ing  syntax  into  phrase-based  translation.  We 
give  a  formal  definition  of  one  such  linear¬ 
time  syntactic  language  model,  detail  its  re¬ 
lation  to  phrase-based  decoding,  and  integrate 
the  model  with  the  Moses  phrase-based  trans¬ 
lation  system.  We  present  empirical  results 
on  a  constrained  Urdu-English  translation  task 
that  demonstrate  a  significant  BLEU  score  im¬ 
provement  and  a  large  decrease  in  perplexity. 

1  Introduction 

Early  work  in  statistical  machine  translation  viewed 
translation  as  a  noisy  channel  process  comprised  of 
a  translation  model,  which  functioned  to  posit  ad¬ 
equate  translations  of  source  language  words,  and 
a  target  language  model,  which  guided  the  fluency 
of  generated  target  language  strings  (Brown  et  al., 
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1990).  Drawing  on  earlier  successes  in  speech 
recognition,  research  in  statistical  machine  trans¬ 
lation  has  effectively  used  n-gram  word  sequence 
models  as  language  models. 

Modern  phrase-based  translation  using  large  scale 
n-gram  language  models  generally  performs  well 
in  terms  of  lexical  choice,  but  still  often  produces 
ungrammatical  output.  Syntactic  parsing  may  help 
produce  more  grammatical  output  by  better  model¬ 
ing  structural  relationships  and  long-distance  depen¬ 
dencies.  Bottom-up  and  top-down  parsers  typically 
require  a  completed  string  as  input;  this  requirement 
makes  it  difficult  to  incorporate  these  parsers  into 
phrase-based  translation,  which  generates  hypothe¬ 
sized  translations  incrementally,  from  left-to-right. 1 
As  a  workaround,  parsers  can  rerank  the  translated 
output  of  translation  systems  (Och  et  al.,  2004). 

On  the  other  hand,  incremental  parsers  (Roark, 
2001;  Henderson,  2004;  Schuler  et  al.,  2010;  Huang 
and  Sagae,  2010)  process  input  in  a  straightforward 
left-to-right  manner.  We  observe  that  incremental 
parsers,  used  as  structured  language  models,  pro¬ 
vide  an  appropriate  algorithmic  match  to  incremen¬ 
tal  phrase-based  decoding.  We  directly  integrate  in¬ 
cremental  syntactic  parsing  into  phrase-based  trans¬ 
lation.  This  approach  re-exerts  the  role  of  the  lan¬ 
guage  model  as  a  mechanism  for  encouraging  syn¬ 
tactically  fluent  translations. 

The  contributions  of  this  work  are  as  follows: 

•  A  novel  method  for  integrating  syntactic  LMs 
into  phrase-based  translation  (§3) 

•  A  formal  definition  of  an  incremental  parser  for 

'While  not  all  languages  are  written  left-to-right,  we  will 
refer  to  incremental  processing  which  proceeds  from  the  begin¬ 
ning  of  a  sentence  as  left-to-right. 
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statistical  MT  that  can  run  in  linear-time  (§4) 

•  Integration  with  Moses  (§5)  along  with  empiri¬ 
cal  results  for  perplexity  and  significant  transla¬ 
tion  score  improvement  on  a  constrained  Urdu- 
English  task  (§6) 

2  Related  Work 

Neither  phrase-based  (Koehn  et  ah,  2003)  nor  hierar¬ 
chical  phrase-based  translation  (Chiang,  2005)  take 
explicit  advantage  of  the  syntactic  structure  of  either 
source  or  target  language.  The  translation  models  in 
these  techniques  define  phrases  as  contiguous  word 
sequences  (with  gaps  allowed  in  the  case  of  hierar¬ 
chical  phrases)  which  may  or  may  not  correspond 
to  any  linguistic  constituent.  Early  work  in  statisti¬ 
cal  phrase-based  translation  considered  whether  re¬ 
stricting  translation  models  to  use  only  syntactically 
well-formed  constituents  might  improve  translation 
quality  (Koehn  et  ah,  2003)  but  found  such  restric¬ 
tions  failed  to  improve  translation  quality. 

Significant  research  has  examined  the  extent  to 
which  syntax  can  be  usefully  incorporated  into  sta¬ 
tistical  tree-based  translation  models:  string-to-tree 
(Yamada  and  Knight,  2001;  Gildea,  2003;  Imamura 
et  ah,  2004;  Galley  et  ah,  2004;  Graehl  and  Knight, 
2004;  Melamed,  2004;  Galley  et  ah,  2006;  Huang 
et  ah,  2006;  Shen  et  ah,  2008),  tree-to-string  (Liu 
et  ah,  2006;  Liu  et  ah,  2007;  Mi  et  ah,  2008;  Mi 
and  Huang,  2008;  Huang  and  Mi,  2010),  tree-to-tree 
(Abeille  et  ah,  1990;  Shieber  and  Schabes,  1990; 
Poutsma,  1998;  Eisner,  2003;  Shieber,  2004;  Cowan 
et  ah,  2006;  Nesson  et  ah,  2006;  Zhang  et  ah,  2007; 
DeNeefe  et  ah,  2007;  DeNeefe  and  Knight,  2009; 
Liu  et  ah,  2009;  Chiang,  2010),  and  treelet  (Ding 
and  Palmer,  2005;  Quirk  et  ah,  2005)  techniques 
use  syntactic  information  to  inform  the  translation 
model.  Recent  work  has  shown  that  parsing-based 
machine  translation  using  syntax-augmented  (Zoll- 
mann  and  Venugopal,  2006)  hierarchical  translation 
grammars  with  rich  nonterminal  sets  can  demon¬ 
strate  substantial  gains  over  hierarchical  grammars 
for  certain  language  pairs  (Baker  et  ah,  2009).  In 
contrast  to  the  above  tree-based  translation  models, 
our  approach  maintains  a  standard  (non-syntactic) 
phrase-based  translation  model.  Instead,  we  incor¬ 
porate  syntax  into  the  language  model. 

Traditional  approaches  to  language  models  in 


speech  recognition  and  statistical  machine  transla¬ 
tion  focus  on  the  use  of  n-grams,  which  provide  a 
simple  finite-state  model  approximation  of  the  tar¬ 
get  language.  Chelba  and  Jelinek  (1998)  proposed 
that  syntactic  structure  could  be  used  as  an  alterna¬ 
tive  technique  in  language  modeling.  This  insight 
has  been  explored  in  the  context  of  speech  recogni¬ 
tion  (Chelba  and  Jelinek,  2000;  Collins  et  ah,  2005). 
Hassan  et  ah  (2007)  and  Birch  et  ah  (2007)  use 
supertag  n-gram  LMs.  Syntactic  language  models 
have  also  been  explored  with  tree-based  translation 
models.  Charniak  et  ah  (2003)  use  syntactic  lan¬ 
guage  models  to  rescore  the  output  of  a  tree-based 
translation  system.  Post  and  Gildea  (2008)  investi¬ 
gate  the  integration  of  parsers  as  syntactic  language 
models  during  binary  bracketing  transduction  trans¬ 
lation  (Wu,  1997);  under  these  conditions,  both  syn¬ 
tactic  phrase-structure  and  dependency  parsing  lan¬ 
guage  models  were  found  to  improve  oracle -best 
translations,  but  did  not  improve  actual  translation 
results.  Post  and  Gildea  (2009)  use  tree  substitution 
grammar  parsing  for  language  modeling,  but  do  not 
use  this  language  model  in  a  translation  system.  Our 
work,  in  contrast  to  the  above  approaches,  explores 
the  use  of  incremental  syntactic  language  models  in 
conjunction  with  phrase-based  translation  models. 

Our  syntactic  language  model  fits  into  the  fam¬ 
ily  of  linear-time  dynamic  programming  parsers  de¬ 
scribed  in  (Huang  and  Sagae,  2010).  Like  (Galley 
and  Manning,  2009)  our  work  implements  an  in¬ 
cremental  syntactic  language  model;  our  approach 
differs  by  calculating  syntactic  LM  scores  over  all 
available  phrase-structure  parses  at  each  hypothesis 
instead  of  the  1-best  dependency  parse. 

The  syntax-driven  reordering  model  of  Ge  (2010) 
uses  syntax-driven  features  to  influence  word  order 
within  standard  phrase-based  translation.  The  syn¬ 
tactic  cohesion  features  of  Cherry  (2008)  encour¬ 
ages  the  use  of  syntactically  well-formed  translation 
phrases.  These  approaches  arc  fully  orthogonal  to 
our  proposed  incremental  syntactic  language  model, 
and  could  be  applied  in  concert  with  our  work. 

3  Parser  as  Syntactic  Language  Model  in 
Phrase-Based  Translation 

Parsing  is  the  task  of  selecting  the  representation  f 
(typically  a  tree)  that  best  models  the  structure  of 
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Figure  1 :  Partial  decoding  lattice  for  standard  phrase-based  decoding  stack  algorithm  translating  the  German 
sentence  Der  Prdsident  trifft  am  Freitag  den  Vorstand.  Each  node  h  in  decoding  stack  t  represents  the 
application  of  a  translation  option,  and  includes  the  source  sentence  coverage  vector,  target  language  n- 
gram  state,  and  syntactic  language  model  state  fth.  Hypothesis  combination  is  also  shown,  indicating 
where  lattice  paths  with  identical  re-gram  histories  converge.  We  use  the  English  translation  The  president 
meets  the  board  on  Friday  as  a  running  example  throughout  all  Figures. 


sentence  e,  out  of  all  such  possible  representations 
t.  This  set  of  representations  may  be  all  phrase 
structure  trees  or  all  dependency  trees  allowed  by 
the  parsing  model.  Typically,  tree  f  is  taken  to  be: 

f  =  argrnax  P(r  |  e)  (1) 

T 

We  define  a  syntactic  language  model  P  (e)  based 
on  the  total  probability  mass  over  all  possible  trees 
for  string  e.  This  is  shown  in  Equation  2  and  decom¬ 
posed  in  Equation  3. 


P(e)  =  X]  p(r’e) 

(2) 

tEt 

p(e)  =  5^  P(e|r)P(r) 

(3) 

tSt 


3.1  Incremental  syntactic  language  model 

An  incremental  parser  processes  each  token  of  in¬ 
put  sequentially  from  the  beginning  of  a  sentence  to 
the  end,  rather  than  processing  input  in  a  top-down 
(Earley,  1968)  or  bottom-up  (Cocke  and  Schwartz, 
1970;  Kasami,  1965;  Younger,  1967)  fashion.  After 


processing  the  fth  token  in  string  e,  an  incremen¬ 
tal  parser  has  some  internal  representation  of  possi¬ 
ble  hypothesized  (incomplete)  trees,  r*.  The  syntac¬ 
tic  language  model  probability  of  a  partial  sentence 
e\...et  is  defined: 

P(ei...et)  =  22  P(ei...et  |  t)P(t)  (4) 

TgTt 

In  practice,  a  parser  may  constrain  the  set  of  trees 
under  consideration  to  Tt,  that  subset  of  analyses  or 
partial  analyses  that  remains  after  any  pruning  is  per¬ 
formed.  An  incremental  syntactic  language  model 
can  then  be  defined  by  a  probability  mass  function 
(Equation  5)  and  a  transition  function  5  (Equation 
6).  The  role  of  S  is  explained  in  §3.3  below.  Any 
parser  which  implements  these  two  functions  can 
serve  as  a  syntactic  language  model. 

P(ei...et)  «  P(n)  =  22  P(ei-..et  |  r)P(r)  (5) 

raft 

5{et,Tt-\)  — >  Tt  (6) 
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3.2  Decoding  in  phrase-based  translation 

Given  a  source  language  input  sentence  f ,  a  trained 
source-to-target  translation  model,  and  a  target  lan¬ 
guage  model,  the  task  of  translation  is  to  find  the 
maximally  probable  translation  e  using  a  linear 
combination  of  j  feature  functions  h  weighted  ac¬ 
cording  to  tuned  parameters  A  (Och  and  Ney,  2002). 

e  =  argmaxexp(y^  A jhj(e,  /))  (7) 

e 

3 

Phrase-based  translation  constructs  a  set  of  trans¬ 
lation  options  —  hypothesized  translations  for  con¬ 
tiguous  portions  of  the  source  sentence  —  from  a 
trained  phrase  table,  then  incrementally  constructs  a 
lattice  of  partial  target  translations  (Koehn,  2010). 
To  prune  the  search  space,  lattice  nodes  are  orga¬ 
nized  into  beam  stacks  (Jelinek,  1969)  according  to 
the  number  of  source  words  translated.  An  ra-gram 
language  model  history  is  also  maintained  at  each 
node  in  the  translation  lattice.  The  search  space 
is  further  trimmed  with  hypothesis  recombination, 
which  collapses  lattice  nodes  that  share  a  common 
coverage  vector  and  n-gram  state. 

3.3  Incorporating  a  Syntactic  Language  Model 

Phrase-based  translation  produces  target  language 
words  in  an  incremental  left-to-right  fashion,  gen¬ 
erating  words  at  the  beginning  of  a  translation  first 
and  words  at  the  end  of  a  translation  last.  Similarly, 
incremental  parsers  process  sentences  in  an  incre¬ 
mental  fashion,  analyzing  words  at  the  beginning  of 
a  sentence  first  and  words  at  the  end  of  a  sentence 
last.  As  such,  an  incremental  parser  with  transition 
function  5  can  be  incorporated  into  the  phrase-based 
decoding  process  in  a  straightforward  manner.  Each 
node  in  the  translation  lattice  is  augmented  with  a 
syntactic  language  model  state  ft. 

The  hypothesis  at  the  root  of  the  translation  lattice 
is  initialized  with  To,  representing  the  internal  state 
of  the  incremental  parser  before  any  input  words  arc 
processed.  The  phrase-based  translation  decoding 
process  adds  nodes  to  the  lattice;  each  new  node 
contains  one  or  more  target  language  words.  Each 
node  contains  a  backpointer  to  its  parent  node,  in 
which  ft-  i  is  stored.  Given  a  new  target  language 
word  et  and  ft- 1,  the  incremental  parser’s  transi¬ 
tion  function  5  calculates  ft.  Figure  1  illustrates 


S 

NP  VP 

DT  NN  VP  PP 

The  president  VB  NP  IN  NP 
meets  DT  NN  on  Friday 
the  board 

Figure  2:  Sample  binarized  phrase  structure  tree. 
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NP 

VP/NN 

~NN 

NP/NN  NN 

VP/NP 

DT 

board 

DT  president 

VB 

| 

the 

The 

meets 

Figure  3:  Sample  binarized  phrase  structure  tree  af¬ 
ter  application  of  right-corner  transform. 

a  sample  phrase-based  decoding  lattice  where  each 
translation  lattice  node  is  augmented  with  syntactic 
language  model  state  ft. 

In  phrase-based  translation,  many  translation  lat¬ 
tice  nodes  represent  multi-word  target  language 
phrases.  For  such  translation  lattice  nodes,  5  will 
be  called  once  for  each  newly  hypothesized  target 
language  word  in  the  node.  Only  the  final  syntac¬ 
tic  language  model  state  in  such  sequences  need  be 
stored  in  the  translation  lattice  node. 

4  Incremental  Bounded-Memory  Parsing 
with  a  Time  Series  Model 

Having  defined  the  framework  by  which  any  in¬ 
cremental  parser  may  be  incorporated  into  phrase- 
based  translation,  we  now  formally  define  a  specific 
incremental  parser  for  use  in  our  experiments. 

The  parser  must  process  target  language  words 
incrementally  as  the  phrase-based  decoder  adds  hy¬ 
potheses  to  the  translation  lattice.  To  facilitate  this 
incremental  processing,  ordinary  phrase-structure 
frees  can  be  transformed  into  right-corner  recur- 
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4.1  Formal  Parsing  Model:  Scoring  Partial 
Translation  Hypotheses 

This  model  is  essentially  an  extension  of  an  HHMM, 
which  obtains  a  most  likely  sequence  of  hidden  store 
states,  s\"®,  of  some  length  T  and  some  maxi¬ 
mum  depth  D,  given  a  sequence  of  observed  tokens 
(e.g.  generated  target  language  words),  e\ __t,  using 
HHMM  state  transition  model  f)  \  and  observation 
symbol  model  6b  (Rabiner,  1990): 


Figure  4:  Graphical  representation  of  the  depen¬ 
dency  structure  in  a  standard  Hierarchic  Hidden 
Markov  Model  with  D  =  3  hidden  levels  that  can 
be  used  to  parse  syntax.  Circles  denote  random  vari¬ 
ables,  and  edges  denote  conditional  dependencies. 
Shaded  circles  denote  variables  with  observed  val¬ 
ues. 


sive  phrase  structure  trees  using  the  tree  transforms 
in  Schuler  et  al.  (2010).  Constituent  nontermi¬ 
nals  in  right-corner  transformed  trees  take  the  form 
of  incomplete  constituents  c^/cq^  consisting  of  an 
‘active’  constituent  cv  lacking  an  ‘awaited’  con¬ 
stituent  cVi  yet  to  come,  similar  to  non-constituent 
categories  in  a  Combinatory  Categorial  Grammar 
(Ades  and  Steedman,  1982;  Steedman,  2000).  As 
an  example,  the  parser  might  consider  VP/NN  as  a 
possible  category  for  input  “meets  the”. 

A  sample  phrase  structure  tree  is  shown  before 
and  after  the  right-corner  transform  in  Figures  2 
and  3.  Our  parser  operates  over  a  right-corner  trans¬ 
formed  probabilistic  context-free  grammar  (PCFG). 
Parsing  runs  in  linear  time  on  the  length  of  the  input. 
This  model  of  incremental  parsing  is  implemented 
as  a  Hierarchical  Hidden  Markov  Model  (HHMM) 
(Murphy  and  Paskin,  2001),  and  is  equivalent  to  a 
probabilistic  pushdown  automaton  with  a  bounded 
pushdown  store.  The  parser  runs  in  O(n)  time, 
where  n  is  the  number  of  words  in  the  input.  This 
model  is  shown  graphically  in  Figure  4  and  formally 
defined  in  §4. 1  below. 

The  incremental  parser  assigns  a  probability 
(Eq.  5)  for  a  partial  target  language  hypothesis,  using 
a  bounded  store  of  incomplete  constituents  Crj/cVi. 
The  phrase -based  decoder  uses  this  probability  value 
as  the  syntactic  language  model  feature  score. 


gl..£)  def 
S1..T  ~ 


argmaxTT  PeA(4" 
1 


D  i  l..D\ 
st-l  ) 


PeB(et 


,1  ..D\ 


t= t 


(8) 

The  HHMM  parser  is  equivalent  to  a  probabilis¬ 
tic  pushdown  automaton  with  a  bounded  push¬ 
down  store.  The  model  generates  each  successive 
store  (using  store  model  9$)  only  after  considering 
whether  each  nested  sequence  of  incomplete  con¬ 
stituents  has  completed  and  reduced  (using  reduc¬ 
tion  model  0r): 


peA(s 


1..D 

t 


l..D\def 
sf-l  )  ~ 


D 


e  np».<4 


d\- 1  „d 


H- 1 


ri-r?d -1 


r 


dt-l  d  d  d-l\ 
t  st-lst  ) 


(9) 


Store  elements  arc  defined  to  contain  only  the 
active  (cn)  and  awaited  (er;J  constituent  categories 
necessary  to  compute  an  incomplete  constituent 
probability: 

sf={cv,cvl,)  (10) 

Reduction  states  arc  defined  to  contain  only  the 
complete  constituent  category  crd  necessary  to  com¬ 
pute  an  inside  likelihood  probability,  as  well  as  a 
flag  frd  indicating  whether  a  reduction  has  taken 
place  (to  end  a  sequence  of  incomplete  constituents): 

rt={Cvt,frf)  (11) 

The  model  probabilities  for  these  store  elements 
and  reduction  states  can  then  be  defined  (from  Mur¬ 
phy  and  Paskin  2001)  to  expand  a  new  incomplete 
constituent  after  a  reduction  has  taken  place  (frd  = 
1;  using  depth-specific  store  state  expansion  model 
0$-e  ,/),  transition  along  a  sequence  of  store  elements 
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Figure  5:  Graphical  representation  of  the  Hierarchic  Hidden  Markov  Model  after  parsing  input  sentence  The 
president  meets  the  board  on  Friday.  The  shaded  path  through  the  parse  lattice  illustrates  the  recognized 
right-corner  tree  structure  of  Figure  3. 


if  no  reduction  has  taken  place  ( frd  =0;  using  depth- 
specific  store  state  transition  model  $s-t ,d)'-  2 
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and  possibly  reduce  a  store  element  (terminate 
a  sequence)  if  the  store  state  below  it  has  re¬ 
duced  ( frf 4  =  1;  using  depth- specific  reduction 
model  0R4): 


rdH  d 
rt  sl-l 


def 


fif/rf1=0  :  [rf  =  rj 

l if  frf*  =  1  :  PeR,d(Gd  I  rffl  4-1  stl) 


(13) 


where  r_|_  is  a  null  state  resulting  from  the  failure  of 
an  incomplete  constituent  to  complete,  and  constants 
arc  defined  for  the  edge  conditions  of  .s(/  and  r{)+ 1 . 
Figure  5  illustrates  this  model  in  action. 

These  pushdown  automaton  operations  arc  then 
refined  for  right-corner  parsing  (Schuler,  2009), 
distinguishing  active  transitions  (model  #s-t-a ,d,  in 
which  an  incomplete  constituent  is  completed,  but 
not  reduced,  and  then  immediately  expanded  to  a 

2  An  indicator  function  [•]  is  used  to  denote  deterministic 
probabilities:  [<(>]]  =  1  if  <j>  is  true,  0  otherwise. 


new  incomplete  constituent  in  the  same  store  el¬ 
ement)  from  awaited  transitions  (model  0s-T-w,d, 
which  involve  no  completion): 

fif  r?^r±:  Pes.r.AiM  I  (U) 

l if  rf  =  r± :  P 9s.T^d(sf  \  sf^rf^1)  1 


P0R,d(?1 


d  |  „*-l  d  d-1  \  def 
t  st-lst-l  >  ~ 


if  crfcL^Xt:  {rf  =  rj 

if  Crm  =Xt :  P eR.RArt  I  stl) 


(15) 


These  HHMM  right-corner  parsing  operations  are 
then  defined  in  terms  of  branch-  and  depth-specific 
PCFG  probabilities  #g-m  an(i  $g-l ,d-  3 

'Model  probabilities  are  also  defined  in  terms  of  left- 
progeny  probability  distribution  EeGRL*d  which  is  itself  defined 
in  terms  of  PCFG  probabilities: 

^®g-rl *a(Cri  Cr]0  "■)  =  ^ea-ti.,d(cv  cv0  Ciji)  (16) 

crjl 

E®g-rl Cv0k0  ■■')  =  ^ZE0G_RR,d(cv  ->  Cv0k  ...) 

Cr,Ok 
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OO 
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“  EeG-RL*d(C?7  “ ' '  0,11  — ) 

(19) 

625 


Figure  6:  A  hypothesis  in  the  phrase-based  decoding  lattice  from  Figure  1  is  expanded  using  translation  op¬ 
tion  the  board  of  source  phrase  den  Vorstand.  Syntactic  language  model  state  T31  contains  random  variables 
S3'3;  likewise  Tr>-,  contains  si"3.  The  intervening  random  variables  r3"3,  s|' '3,  and  ri"3  ai'e  calculated  by 
transition  function  S  (Eq.  6,  as  defined  by  §4.1),  but  are  not  stored.  Observed  random  variables  (('3.-65)  are 
shown  for  clarity,  but  are  not  explicitly  stored  in  any  syntactic  language  model  state. 
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coder’s  hypothesis  stacks.  Figure  1  illustrates  an  ex¬ 
cerpt  from  a  standard  phrase-based  translation  lat¬ 
tice.  Within  each  decoder  stack  t,  each  hypothe¬ 
sis  h  is  augmented  with  a  syntactic  language  model 
state  Tth-  Each  syntactic  language  model  state  is 
a  random  variable  store,  containing  a  slice  of  ran¬ 
dom  variables  from  the  HHMM.  Specifically,  fth 
contains  those  random  variables  sj  "D  that  maintain 
distributions  over  syntactic  elements. 

By  maintaining  these  syntactic  random  variable 
stores,  each  hypothesis  has  access  to  the  current 
language  model  probability  for  the  partial  transla¬ 
tion  ending  at  that  hypothesis,  as  calculated  by  an 
incremental  syntactic  language  model  defined  by 
the  HHMM.  Specifically,  the  random  variable  store 
at  hypothesis  h  provides  P(fth)  =  P 
where  t  is  the  sequence  of  words  in  a  partial  hy¬ 
pothesis  ending  at  h  which  contains  t  target  words, 
and  where  there  are  D  syntactic  random  variables  in 
each  random  variable  store  (Eq.  5). 

During  stack  decoding,  the  phrase -based  decoder 
progressively  constructs  new  hypotheses  by  extend¬ 
ing  existing  hypotheses.  New  hypotheses  arc  placed 
in  appropriate  hypothesis  stacks.  In  the  simplest 
case,  a  new  hypothesis  extends  an  existing  hypothe¬ 
sis  by  exactly  one  target  word.  As  the  new  hypothe¬ 
sis  is  constructed  by  extending  an  existing  stack  ele¬ 
ment,  the  store  and  reduction  state  random  variables 
arc  processed,  along  with  the  newly  hypothesized 
word.  This  results  in  a  new  store  of  syntactic  ran¬ 
dom  variables  (Eq.  6)  that  arc  associated  with  the 
new  stack  element. 

When  a  new  hypothesis  extends  an  existing  hy¬ 
pothesis  by  more  than  one  word,  this  process  is  first 
carried  out  for  the  first  new  word  in  the  hypothe¬ 
sis.  It  is  then  repeated  for  the  remaining  words  in 
the  hypothesis  extension.  Once  the  final  word  in 
the  hypothesis  has  been  processed,  the  resulting  ran¬ 
dom  variable  store  is  associated  with  that  hypoth¬ 
esis.  The  random  variable  stores  created  for  the 
non-final  words  in  the  extending  hypothesis  arc  dis¬ 
carded,  and  need  not  be  explicitly  retained. 

Figure  6  illustrates  this  process,  showing  how  a 
syntactic  language  model  state  T51  in  a  phrase-based 
decoding  lattice  is  obtained  from  a  previous  syn¬ 
tactic  language  model  state  ra1  (from  Figure  1)  by 
parsing  the  target  language  words  from  a  phrase- 
based  translation  option. 
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WSJ  HHMM 
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Interpolated  WSJ 
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+  WSJ  HHMM 

222.39 

123.10 

Interp.  Giga  5-gr 
+  WSJ  5 -gram 
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321.05 

Figure  7:  Average  per- word  perplexity  values. 
HHMM  was  run  with  beam  size  of  2000.  Bold  in¬ 
dicates  best  single-model  results  for  LMs  trained  on 
WSJ  sections  2-21.  Best  overall  in  italics. 


Our  syntactic  language  model  is  integrated  into 
the  current  version  of  Moses  (Koehn  et  ah,  2007). 


6  Results 


As  an  initial  measure  to  compare  language  models, 
average  per-word  perplexity,  ppl,  reports  how  sur¬ 
prised  a  model  is  by  test  data.  Equation  25  calculates 
ppl  using  log  base  b  for  a  test  set  of  T  tokens. 


~l°9b  P(ei  -  .eT) 

ppl  =  0  T 


(25) 


We  trained  the  syntactic  language  model  from 
§4  (HHMM)  and  an  interpolated  n-gram  language 
model  with  modified  Kneser-Ney  smoothing  (Chen 
and  Goodman,  1998);  models  were  trained  on  sec¬ 
tions  2-21  of  the  Wall  Street  Journal  (WSJ)  tree- 
bank  (Marcus  et  ah,  1993).  The  HHMM  outper¬ 
forms  the  n-gram  model  in  terms  of  out-of-domain 
test  set  perplexity  when  trained  on  the  same  WSJ 
data;  the  best  perplexity  results  for  in-domain  and 
out-of-domain  test  sets4  arc  found  by  interpolating 

4In-domain  is  WSJ  Section  23.  Out-of-domain  are  the  En¬ 
glish  reference  translations  of  the  dev  section  ,  set  aside  in 
(Baker  et  al.,  2009)  for  parameter  tuning,  of  the  NIST  Open 
MT  2008  Urdu-English  task. 
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length 

Moses 

+HHMM 

beam=50 

+HHMM 

beam=2000 

10 

0.21 

533 

1143 

20 

0.53 

1193 

2562 

30 

0.85 

1746 

3749 

40 

1.13 

2095 

4588 

Figure  8:  Mean  per-sentence  decoding  time  (in  sec¬ 
onds)  for  dev  set  using  Moses  with  and  without  syn¬ 
tactic  language  model.  HHMM  parser  beam  sizes 
arc  indicated  for  the  syntactic  LM. 

HHMM  and  n-gram  LMs  (Figure  7).  To  show  the 
effects  of  training  an  LM  on  more  data,  we  also  re¬ 
port  perplexity  results  on  the  5 -gram  LM  trained  for 
the  GALE  Arabic-English  task  using  the  English  Gi- 
gaword  corpus.  In  all  cases,  including  the  HHMM 
significantly  reduces  perplexity. 

We  trained  a  phrase-based  translation  model  on 
the  full  NIST  Open  MT08  Urdu-English  translation 
model  using  the  full  training  data.  We  trained  the 
HHMM  and  n-gram  LMs  on  the  WSJ  data  in  order 
to  make  them  as  similar  as  possible.  During  tuning, 
Moses  was  first  configured  to  use  just  the  n-gram 
LM,  then  configured  to  use  both  the  n-gram  LM  and 
the  syntactic  HHMM  LM.  MERT  consistently  as¬ 
signed  positive  weight  to  the  syntactic  LM  feature, 
typically  slightly  less  than  the  n-gram  LM  weight. 

In  our  integration  with  Moses,  incorporating  a 
syntactic  language  model  dramatically  slows  the  de¬ 
coding  process.  Figure  8  illustrates  a  slowdown 
around  three  orders  of  magnitude.  Although  speed 
remains  roughly  linear  to  the  size  of  the  source  sen¬ 
tence  (ruling  out  exponential  behavior),  it  is  with  an 
extremely  large  constant  time  factor.  Due  to  this 
slowdown,  we  tuned  the  parameters  using  a  con¬ 
strained  dev  set  (only  sentences  with  1-20  words), 
and  tested  using  a  constrained  devtest  set  (only  sen¬ 
tences  with  1-20  words).  Figure  9  shows  a  statis¬ 
tically  significant  improvement  to  the  BLEU  score 
when  using  the  HHMM  and  the  n-gram  LMs  to¬ 
gether  on  this  reduced  test  set. 

7  Discussion 

This  paper  argues  that  incremental  syntactic  lan¬ 
guages  models  arc  a  straightforward  and  appro- 


Moses  LM(s) 

BLEU 

n-gram  only 

18.78 

HHMM  +  n-gram 

19.78 

Figure  9:  Results  for  Ur-En  devtest  (only  sentences 
with  1-20  words)  with  HHMM  beam  size  of  2000 
and  Moses  settings  of  distortion  limit  10,  stack  size 
200,  and  ttable  limit  20. 

priate  algorithmic  tit  for  incorporating  syntax  into 
phrase-based  statistical  machine  translation,  since 
both  process  sentences  in  an  incremental  left-to- 
right  fashion.  This  means  incremental  syntactic  LM 
scores  can  be  calculated  during  the  decoding  pro¬ 
cess,  rather  than  waiting  until  a  complete  sentence  is 
posited,  which  is  typically  necessary  in  top-down  or 
bottom-up  parsing. 

We  provided  a  rigorous  formal  definition  of  in¬ 
cremental  syntactic  languages  models,  and  detailed 
what  steps  arc  necessary  to  incorporate  such  LMs 
into  phrase-based  decoding.  We  integrated  an  incre¬ 
mental  syntactic  language  model  into  Moses.  The 
translation  quality  significantly  improved  on  a  con¬ 
strained  task,  and  the  perplexity  improvements  sug¬ 
gest  that  interpolating  between  n-gram  and  syntactic 
LMs  may  hold  promise  on  larger  data  sets. 

The  use  of  very  large  n-gram  language  models  is 
typically  a  key  ingredient  in  the  best-performing  ma¬ 
chine  translation  systems  (Brants  et  al.,  2007).  Our 
n-gram  model  trained  only  on  WSJ  is  admittedly 
small.  Our  future  work  seeks  to  incorporate  large- 
scale  n-gram  language  models  in  conjunction  with 
incremental  syntactic  language  models. 

The  added  decoding  time  cost  of  our  syntactic 
language  model  is  very  high.  By  increasing  the 
beam  size  and  distortion  limit  of  the  baseline  sys¬ 
tem,  future  work  may  examine  whether  a  baseline 
system  with  comparable  runtimes  can  achieve  com¬ 
parable  translation  quality. 

A  more  efficient  implementation  of  the  HHMM 
parser  would  speed  decoding  and  make  more  exten¬ 
sive  and  conclusive  translation  experiments  possi¬ 
ble.  Various  additional  improvements  could  include 
caching  the  HHMM  LM  calculations,  and  exploiting 
properties  of  the  right-corner  transform  that  limit  the 
number  of  decisions  between  successive  time  steps. 
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